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' The martingale method is used to establish concentration in- 

equalities for a class of dependent random sequences on a countable 
state space, with the constants in the inequalities expressed in terms 
of certain mixing coefficients. Along the way, bounds are obtained on 
martingale differences associated with the random sequences, which 
Qh ' may be of independent interest. As applications of the main result, 

(— i concentration inequalities are also derived for inhomogeneous Markov 

chains and hidden Markov chains, and an extremal property asso- 
ciated with their martingale difference bounds is established. This 
work complements and generalizes certain concentration inequalities 
obtained by Marton and Samson, while also providing different proofs 
of some known results. 



> 

^ ■ 1. Introduction. 

00 i 

ON, 1.1. Background. Concentration of measure is a fairly general phenomenon 

which, roughly speaking, asserts that a function ip : £1 — > M with "suitably 
small" local oscillations defined on a "high-dimensional" probability space 
"^h ! (p,,J-,¥), almost always takes values that are "close" to the average (or 

median) value of <p> on Q.. Under various assumptions on the function (p 
and different choices of metrics, this phenomenon has been quite extensively 
studied in the case when P is a product measure on a product space (£l,J-) 
^ ' or, equivalently, when (p is a function of a large number of i.i.d. random 

'X' 
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variables (see, e.g., the surveys of Talagrand [29, 30], Ledoux [17], McDi- 
armid [25] and references therein). Concentration inequalities have found 
numerous applications in a variety of fields (see, e.g., [6, 25, 28]). 

The situation is naturally far more complex for nonproduct measures, 
where one can trivially construct examples where the concentration prop- 
erty fails. For functions of dependent random variables (Xj)jgN, the crux 
of the problem is often to quantify and bound the dependence among the 
random variables Xi, in terms of various types of mixing coefficients. A suf- 
ficiently rapid decay of the mixing coefficients often allows one to establish 
concentration results [22, 23, 27]. 

A number of techniques have been used to prove measure concentration. 
Among these are isoperimetric inequalities and the induction method of Ta- 
lagrand [29, 30], log-Sobolev inequalities developed by Ledoux and others 
[5, 16, 23, 27], information-theoretic techniques [1, 9, 12, 19, 20, 21, 27], 
martingale methods based on the Azuma-Hoeffding inequality [2, 8, 11, 25], 
transportation inequalities (see, e.g., [10, 26]), and Stein's method of ex- 
changeable pairs, recently employed by Chatterjee [7]. The information- 
theoretic approach has proved quite useful for dealing with nonproduct 
measures. In a series of papers, Marton [19, 20, 21, 22, 23] successfully 
used these techniques to establish concentration inequalities for collections 
of dependent random variables under various assumptions. In this work we 
adopt a completely different approach, based on the martingale method, to 
establish concentration bounds for dependent random variables. In the pro- 
cess we establish bounds on certain martingale differences, which may be of 
independent interest. 

In the next subsection we provide a precise description of our main results 
and discuss their relation to prior work. The subsequent subsections provide 
an outline of the paper and collect some common notation that we use. 

1.2. Description of main results. Consider a collection of random vari- 
ables (Xi)i<i< n taking values in a countable space S. Let T be the set of 
all subsets of S n and let P be the probability distribution induced by the 
finite sequence X = (X\, . . . , X n ) on (S n , T). Then we can (and will) assume 
without loss of generality that Xi, 1 < i < n, are the coordinate projections 
defined on the probability space (S n ,J-, P). Given 1 < i < j < n, x\ is used 
to denote the subsequence [x{, Xj+i, . . . , xj). Similarly, for 1 < i < j < n, Xj 
represents the random vector (Xi, . . . ,Xj). For further simplicity of notation, 
x\ and X\ will be sometimes written simply as x 3 and X 3 , respectively. Let 
S n be equipped with the Hamming metric d:S n x5 n ^ [0, oo), defined by 

n 

d(x,y)=Y,Hx^ yi }- 

i=l 
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Also, let d(x,y) = d(x,y)/n denote the normalized Hamming metric on S n . 
In addition, let E denote expectation with respect to P. Given a function 
<p:S n — > R, we will use the shorthand notation F{\(p — M(p\ > t} instead 
of P{|c^(X) — E[y?(X)]| > t}. Also, given two random variables Y and Z, 
£{Z | Y = y) denotes the conditional distribution of Z given Y = y. 

Our main result is a concentration inequality on the metric probability 
space (S n ,d, P), which is expressed in terms of the following mixing coeffi- 
cients. For 1 < i < j < n, define 

(1.1) fjij= sup rjijiy 1-1 ,™,™), 

y 1 ^ 1 GiS 1-1 ,w,w£S 
F(X i =Y i ~ 1 w)>0,P(X i =Y i - 1 w)>0 

where, for y l ~ l G S 1 ^ 1 and w,w € 5, 

(1.2) THjtf- 1 , w,w) = \\£(X? | X* = y^w) - £(X? \ X* = y^w)^ 

and \\Q — -R|| TV denotes the total variation distance between the probability 
measures Q and R [see (1.17) for a precise definition]. Moreover, let A n be 
the n x n upper triangular matrix defined by 

fl, ifi = j, 
(A n )ij = < fjij, if i<j, 

[ 0, otherwise. 

Observe that the (usual loo) operator norm of the matrix A n is given ex- 
plicitly by 

(1.3) l|A n ||oo = max H ni , 

l<i<n 

where, for 1 < i < n — 1, 

(1-4) fln,i = (1 + Vi,i+1 + • • • + fji,n) 

and H n<n = 1. 

We can now state the concentration result. 



Theorem 1.1. Suppose S is a countable space, T is the set of all subsets 
of S n , ¥ is a probability measure on (S n ,J-) and (p:S n — >R is a c-Lipschitz 
function (with respect to the Hamming metric) on S n for some c> 0. Then 
for any t>0, 

(1.5) P{ | y - Ey |> t} < 2mp (- _^ ). 

Theorem 1.1 follows from Theorem 2.1 and Remark 2.1. For the particular 
case when (Xi, . . . ,X n ) is a (possibly inhomogeneous) Markov chain, the 
bound in Theorem 1.1 simplifies further. More precisely, given any initial 
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probability distribution po(-) and stochastic transition kernels pi{- | •), 1 < 
i < n — 1 , let the probability measure P on S n be defined by 

i-l 

(1.6) P{(X 1 ,...,X i ) = x}= Po (x 1 )l[p ] (x j+1 \x j ) 

3=1 

for every 1 < i < n and every x = (x\, . . . , X{) € <S\ Moreover, let 9i be the 
ith contraction coefficient of the Markov chain: 

(1.7) 9i= sup \\pi(- | x) -pi(- | x")|| TV 

a;',a;"e5 

for 1 < i < n — 1, and set 

(1.8) M n = max (1 + 0< + + • ■ • + 0; ■ • • 9 n ^). 

l<i<n— 1 

Then we have the following result. 

Theorem 1.2. Suppose P is £/ie Markov measure on S n described in 
(1.6), and (p:S n — > M. is a c-Lipschitz function with respect to the Hamming 
metric on S n for some c> 0. Then for any t>0, 

(1.9) P{ | v _ E ^|> t} < 2exp (__|_), 

where M n is given by (1.8). In addition, if ip is c-Lipschitz with respect to 
the normalized Hamming metric, then 

{ nt 2 \ 

(1.10) P{|^ - >t}< 2exp (-^j . 
In particular, when 

M 2 

(1.11) M = sup— ^<oo, 

n n 

the concentration bound (1.10) is dimension-independent. 

Theorem 1.2 follows from Theorem 1.1 and the observation (proved in 
Lemma 7.1) that HA^Hoo < M n for any Markov measure P. In the special 
case when P is a uniformly contracting Markov measure, satisfying 6i < 9 < 1 
for 1 < i <n — 1, we have M n < 1/(1 — 9) for every n and the dimension- 
independence condition (1.11) of Theorem 1.2 also holds trivially. The con- 
stants in the exponent of the upper bounds in (1.5) and (1.9) are not sharp — 
indeed, for the independent case (i.e., when P is a product measure) it is 
known that a sharper bound can be obtained by replacing re/2 by 2n (see 
[24] and [29]). 
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Another application of our technique yields a novel concentration inequal- 
ity for hidden Markov chains. Rather than state the (natural but somewhat 
long-winded) definitions here, we defer them to Section 7. The key result 
(Theorem 7.1) is that the mixing coefficients of a hidden Markov chain 
can be entirely controlled by the contraction coefficients of the underlying 
Markov chain, and so the concentration bound has the same form as (1.10). 
The latter result is at least somewhat surprising, as this relationship fails 
for arbitrary hidden-observed process pairs. 

For the purpose of comparison with prior results, it will be useful to derive 
some simple consequences of Theorem 1.1. If <p is a 1-Lipschitz function 
with respect to the normalized Hamming metric, then it is a 1/n-Lipschitz 
function with respect to the regular Hamming metric, and so it follows from 
Theorem 1.1 that 

(1.12) ¥{\ip-Eip\>t}<a(t), 
where, for t > 0, 

(1.13) a(t) = 2exp 

In turn, this immediately implies the following concentration inequality for 
any median m v of (p: for t > ||A n || 00 -y/ (21n4)/ra, 



(1.14) F{\<p(X)-m v \>t}<2exp(- , -(—— 

Indeed, this is a direct consequence of the fact (stated, as Proposition 1.8 
of [17]) that if (1.12) holds, then the left-hand side of (1.14) is bounded by 
a(t — to), where to = a _1 (l/2). In addition (see, e.g., Proposition 3 of [17]), 
this also shows that ap(-) = «(• — to)/ 2 acts as a concentration function for 
the measure P on iS n equipped with the normalized Hamming metric: in 
other words, for any set AcS n with ¥(A) > 1/2 and t > to, 

(1.15) F(A t )>l-±a(t-t ), 

where A t = {y G S n : d(y, x) < t for some x G A} is the t-fattening of A with 
respect to the normalized Hamming metric. 

The above theorems complement the results of Marton [20, 21, 22] and 
Samson [27]. Theorem 1 of Marton [22] (combined with Lemma 1 of [21] and 
the comment after Proposition 4 of [20]) shows that when S is a complete, 
separable metric space, equipped with the normalized Hamming distance d, 
for any Lipschitz function with ||</?||Lip < 1, the relation (1.15) holds with 
to = CV(ln2)/2n and 

a(t) = 2exp 
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where 



C = max 

Ki<n 



sup 



inf _ E n [d{X n ,X n )], 



with Mi(y,w,w) being the set of probability measures it = C(X n ,X n ) on 
S n x S n , whose marginals are C(X n \ X 1 = yw) and L(X n \ X 1 = yw), re- 
spectively. Moreover, concentration inequalities around the median that are 
qualitatively similar to the one obtained in Theorem 1.2 were obtained for 
strictly contracting Markov chains in [20] (see Proposition 1) and for a class 
of stationary Markov processes in [21] (see Proposition 4'). On the other 
hand, our result in Theorem 1.2 is applicable to a broader class of Markov 
chains, which could be nonstationary and not necessarily uniformly con- 
tracting. 

The mixing coefficients fjij defined in (1.1) also arise in the work of Sam- 
son, who derived concentration bounds for dependent random variables, but 
in a different space with a different metric, and for a more restrictive class 
of functions. Specifically, for the case when S= [0,1], equipped with the 
Euclidean metric, it was shown in Samson [27] that if a function ip : S n — > R 
is convex with Lipschitz constant H^llLip < 1, then 



where ||r||2 is the usual £2 operator norm of the upper-triangular n x n 

matrix T of the form 



The results of both Marton and Samson cited above were obtained using 
a combination of information-theoretic and coupling techniques, as well as 
the duality method of [4]. In contrast, in this paper we adopt a completely 
different approach, based on the martingale method (described in Section 
2) and a linear algebraic perspective, thus also providing alternative proofs 
of some known results. 

The concentration inequality in Theorem 1.1 was obtained almost con- 
temporaneously with the publication of [8] , whose coupling matrix D a is a 
close analogue of our A n . We derive essentially the same martingale differ- 
ence bound as Chazottes et al. by a rather different method — they employ 
a coupling argument while we rely on the linear programming inequality in 
Theorem 4.1. The latter is proved in greater generality (for weighted Ham- 
ming metrics), and in a much simpler way, in Kontorovich's Ph.D. thesis [15]. 







otherwise. 
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1.3. Outline of paper. In Section 1.4 we summarize some basic notation 
used throughout the paper. In Section 2, we set up the basic machinery for 
applying the martingale method, or equivalently the method of bounded 
differences, to our problem. In particular, as stated precisely in Theorem 
2.1, this reduces the proof of Theorem 1.1 to showing that certain martin- 
gale differences Vi(<p;y), 1 < i < n — 1, associated with the measure P on 
S n are uniformly bounded by the operator norm of the matrix A n of 
mixing coefficients. Sections 3-5 are devoted to establishing these bounds 
when S is finite. The proof uses linear algebraic techniques and establishes 
a functional inequality that may be of independent interest. Section 6 then 
uses an approximation argument to extend the bounds to the case when S 
is countable. As applications of the main result, Section 7.1 considers the 
case when P is a (possibly inhomogeneous) Markov measure, and Section 
7.2 performs a similar calculation for measures induced by hidden Markov 
chains. Specifically, Lemma 7.1 establishes the bound HA^loo < M n , which 
then allows Theorem 1.2 to be deduced immediately from Theorem 1.1. Fi- 
nally, in Section 7.3 we describe a class of extremal functions for martingale 
differences associated with Markov measures. Some lemmas not central to 
the paper are collected in the Appendix. 

1.4. Notation and definitions. In addition to the notation introduced in 
Section 1.2, we shall use the following common notation throughout the 
paper. Given a finite or countable set S and finite sequences x & S k and 
y £ S , we use either xy £ S k+l or [x y] G S k+e to denote concatenation of 
the two sequences. The space 5° represents the null string. Also, we will 
use the shorthand notation i to mean icsj-i+i- The random variables 

x^ x^ t 

X = (X\, . . . ,X n ) defined on (S n ,J : ',¥) will always represent coordinate pro- 
jections: Xi(x) = Xi. Therefore for conciseness, we will sometimes simply 
write F(x) and F(x] \ x 1 ) to denote F{X = x} and F{Xf' = x% | X i = x 1 }, 
respectively. As in (1.1), we will always assume (often without explicitmen- 
tion) that terms involving conditional probabilities are restricted to elements 
for which the probability of the conditioned event is strictly positive. 

The indicator variable Ir.i takes on the value 1 if the predicate in the 
bracket {•} is true, and otherwise. The sign function is defined by sgn(z) = 
l{z>o} — 1{2<0} an d the positive function is defined by (z) + = max(z,0) = 
•zi{ 2> 0}- We use the standard convention that X^eA z = an d I\ z gA z = 1 
whenever A is empty (A = 0). 

Throughout the paper, K n denotes the space of all functions K:S n — ► M 
(for finite S) and $ n C K n the subset of 1-Lipschitz functions ip : S n — > [0, n] . 

For a discrete, signed measure space (X,B,v), recall that the l\ norm is 
given by 

(Lie) Hi = EKz)l- 
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Given two probability measures u\ } U2 on the measurable space (X,B), we 
define the total variation distance between the two measures as follows: 

(1.17) \\ui - i^Htv = SU P \ v i{A) - i^ 2 (^4) I 

(which, by common convention, is equal to half the total variation of the 
signed measure u\ — v-i). It is easy to see that in this case, 

(1.18) \V X - Z^ 2 1 1 T v = \\ v \ ~ V l\\ = ~~ v 2(x)) + . 

2. Method of bounded martingale differences. Since our proof relies on 
the so-called martingale method of establishing concentration inequalities, 
here we briefly introduce the method (see [24, 25] or Section 4.1 of [17] for 
more thorough treatments). Let X = {Xi)i<i< n be a collection of random 
variables defined on a probability space (f2,.F, P), taking values in a space 
S. Then given any filtration of sub-cr-algebras, 

{0, 0} = F C T\ C • • • C T n = J 7 , 

and a function ip:S n — >K, define the associated martingale differences by 

(2.1) Vi(<p) = E[p(X) I Ti] - E[<p(X) I J^-i] 

for i = 1, ...,n. It is a classical result, going back to Hoeffding [11] and 
Azuma [2], that 

(2.2) ¥{\(p-E(f\ >r} < 2exp(-r 2 /2£> 2 ) 

for any D such that D 2 > Y2=i II^MII^- 

In the setting of this paper, we have (O,^ 7 , P) = (5 n ,^ r , P), where S is a 
countable set, T is the set of all subsets of S n and X = (A^)i<j< n is the 
collection of coordinate projections. For i = 1, . . . ,n, we set = {0,S n }, 
T n = T and for 1 < i < n — 1, let Ti be the a-algebra generated by X % = 
(X\ , . . . , Xj). Given any function cp on S n , for 1 < i < n, define the martingale 
differences Vi((p) in the standard way, by (2.1). 

The following theorem shows that when <p is Lipschitz, these martingale 
differences can be bounded in terms of the mixing coefficients defined in 
Section 1.2. The proof of the theorem is given in Section 6. 

Theorem 2.1. If S is a countable set and ¥ is a probability measure 
on (<S n ,.F) such that minj = i i ... in inf y ; e5 ;. P ( Xl=yl ) >0 P(X* = y % ) > 0, then for 
any 1-Lipschitz function ip on S n , we have for 1 <i <n, 

n 

(2.3) ||^M||oo<#n,i = l+ E %> 

j=i+l 

where {Vi(<p)} are the martingale differences defined in (2.1) and the coeffi- 
cients fjij and H Tl) i are as defined in (1.1) and (1.4), respectively. 
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Remark2.1. Since \\Vi(-) ||oo is homogeneous in the sense that ||V^(a</?)||o< 
a l|^ / i(v 9 )lloo for a > 0, and a~ 1 tp is 1-Lipschitz whenever tp is a-Lipschitz, The- 
orem 2.1 implies that for 1 < i < n, 

1 1 (</?)! loo < cH n/l 
for any c-Lipschitz if. Along with the relation (1.3), this shows that 



$^l|V5(v)llw <n max H^MH^ < nc 2 max H% 4 = rec 2 ||A n ||^. 

* — ' Ki<n Ki<n 



When combined with Azuma's inequality (2.2), this shows that Theorem 2.1 
implies Theorem 1.1. 

Remark 2.2. The quantity V^(-) is translation-invariant in the sense 
that Vi(ip + a) = Vi{ip) for every a € M. Since the length of the range of any 
1-Lipschitz function on S n is equal to n, the Hamming diameter of S n , for 
the proof of Theorem 2.1 there is no loss of generality in assuming that the 
range of (p lies in [0, re]. 

3. A linear programming bound for martingale differences. In the next 
three sections, we prove Theorem 2.1 under the assumption that S is finite. 
We start by obtaining a slightly more tractable form for the martingale 
difference. 

Lemma 3.1. Given a probability measure P on [S n ,J-) and any function 
ip:S n — >R, let the martingale differences {Vi(ip), 1 < % < n} be defined as in 
(2.1). Then, for 1 < i < n, 

\\Vi(y)\\oo < max ^{^y 1 ' 1 ,w,w)\, 

P(X i =Y i ~ 1 w)>0,F(X i =Y i - 1 w)>0 

where, for y l ~ x 6 S l ~ x and w,w €S, 

(3.1) Vi(<p; y*- 1 ^, w) = E[<p(X) \ X i = y^w] - E[<p(X) \ X 1 = y l ~ l w}. 

Proof. Since Vi(np) is JFj-measurable and T% = a(X l ), it follows imme- 
diately that 

(3.2) IIWI|oo = max \Vi(<p;z% 
where for 1 < i < n and z % E S l , we define 

(3.3) Vi(<p; z*) = E[<p(X) \ X i = z*] - E[<p(X) | A*" 1 = z 1 ' 1 }. 
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Expanding Vi(ip;z l ), we obtain 

= E[tp(X) \X i =z i ]-Y, W I X* = z^w^z^w | z*' 1 ) 
wes 

= J2 I 2 i_1 )(E[p(X) | X i = z*] - E[(p(X) | X i = z^w}) 

= J2 Hz*' 1 ™ | z^Vi^z^^uw), 

where the second equality uses the fact that X^es^- 2 * -1 ^ I z 1 ^ 1 ) = 1 with, 
as usual, F(z i ~ 1 w \ z l ~ l ) representing P{X* = z l ~ l w \ X l ~ l = z*" 1 }. In turn, 
since < ¥{z l ~ l w \ z 1 " 1 ) < 1, the last display implies (via Jensen's inequal- 
ity) that for every z % G <S\ 

\Vi(<p; z')\ < V F{z l ' l w | z'- 1 )^; z l ~\z h w)\ < max z l -\ Zi ,w)\. 

Taking the maximum of both sides over z % G 5* and invoking (3.2), the 
desired inequality is obtained. □ 

For n G N, define the finite-dimensional vector space 

(3.4) K n = { K :S n ^R} 

which becomes a Euclidean space when endowed with the inner product 

(3.5) («;, A) = k(x)X(x). 

Also, let Kq be the collection of scalars. 

Now, note that for y n ~ l G S n ~ l and w,w G S, 

V n (ip; y n -\w, w) = ipiy^w) - ^(y^w), 

and thus for all 1-Lipschitz functions (p, the bound 

(3.6) \V n (ip;y n -\w,w)\ <l = H n>n 

holds immediately. On the other hand, given 1 < i < n - 1, G and 
w,w G 5, the map cp i— > Vi((p; y , w , w) defined in (3.1) is clearly a linear 
functional on K n . It therefore admits a representation as an inner product 
with some element k = n[y l ~ l , w, w] G K n (where the notation [•] is used to 
emphasize the dependence of the function k G K n on and w): 

(3.7) V i (i P ;y i -\w,w) = (K,p). 
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Indeed, expanding (3.1), it is easy to see that the precise form of k = 
n[y % ~ 1 , w, w] is given by 

(3.8) k(x) = l {x i=yi-i w} W(x? +1 | y^w) - t {x i^ yi -i^ } P(xf +1 \ y^w) 
for x G S n . 

Define <l? n C K n to be the subset of 1-Lipschitz functions (with respect 
to the Hamming metric) with range in [0, re]. As observed in Remark 2.2, in 
order to prove Theorem 2.1, it suffices to establish the martingale difference 
bounds (2.3) just for (p £ In light of Lemma 3.1 and the representation 
(3.7), it is therefore natural to study the quantity 

(3.9) IMI* = max \{k, <p)\ 

v>e<i>„ 

for k £ K n . The notation used reflects the fact that || ■ ||<j> defines a norm on 
K n . Since we will not be appealing to any properties of norms, we relegate 
the proof of this fact to Appendix A.l. 

Remark 3.2. The title of this section is motivated by the fact that 
the optimization problem max,,g$ n (K, ip) is a linear program. Indeed, /(•) = 
(k, •) is a linear function and & n is a finitely generated, compact, convex 
polytope. We make no use of this simple fact in our proofs and therefore do 
not prove it, but see Lemma 4.4 for a proof of a closely related claim. 

4. A bound on the f-norm. In Section 3 we motivated the introduction 
of the norm || • ||$ on the space K n . In this section we bound || ■ ||$ by 
another, more tractable, norm, which we call the ^-norm. In Section 5 we 
then bound the ^-norm in terms of the coefficients H n ^. 

For n G N, define the marginal projection operator (•)', which takes k G K n 
to k' £ K n -\ as follows: if n > 1, for each y £ <S n_1 , 

(4.1) n'{y) = ]T K( Xl y); 
if n = 1, then k! is the scalar 

We define the Positive-Summation- Iterated (Psi) functional ^ n : K n — ► R 
recursively using projections: ^o(') = and for n > 1, 

(4.2) *n(«)= E («(*))+ +*„-!(«'), 

where we recall that (z)+ = max(z,0) is the positive part of z. The norm 
associated with ^ n is then defined to be 

(4.3) = max ^n(sn)- 

S6{-1,1} 



12 



L. (A.) KONTOROVICH AND K. RAMANAN 



As in the case of || • ||$, it is easily verified that || • ||^ is a valid norm on K n 
(see Lemma A.l). 

The next theorem is the main result of this section. 

Theorem 4.1. For all n G N and k G K n , 

The remainder of the section is devoted to proving this theorem. See 
Kontorovich's Ph.D. thesis [15] for a considerably simpler proof of a more 
general claim (which covers the weighted Hamming metrics). 

Remark 4.2. We will assume for simplicity that z ^ whenever ex- 
pressions of the form sgn(z) or l{ 2 >o} are encountered below. This incurs 
no loss of generality [the inequalities proved for this special case will hold 
in general by continuity of (•)+] and affords us a slightly cleaner exposition, 
obviating the need to check the z = case. 

First, we need to introduce a bit more notation. For n G N and y G S, 
define the y-section operator (-) y :K n — > K n ~\ that takes k to K y by 

(4.4) K y (x) = hi(xy) 

for x G S n . By convention, for n = 1 and x G 5°, K y (x) is equal to the 
scalar K y = ft(y). 

Note that for any y G <S, the marginal projection and y-section operators 
commute; in other words, for k G K n+ 2, we have {k!) v = (fty)' G -Kn and so 
we can denote this common value simply by n' y G iC n : for each z G 5 n , 

(4.5) «(,(«) = ^ ^(xiz) = J2 K(x!zy). 

Moreover, summing both sides of the first equality in (4.5) over z G 5 n , we 
obtain 

(4.6) E 4W = E H K f( xiz )= k j/( x )- 

We can use y-sections to recast the ^n(') functional in an alternative 
form: 

Lemma 4.3. For all n>0 and k G K n+ \, we have 



(4.7) * n+1 ( K ) = £ 

yes 



*n(«y) + J2 

\xeS n > 
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Proof. Suppose n = 0. Then for k € K\, n y S Kq is the scalar n(y), 
^o( K y) = by definition and 5° consists of just one element (the null string). 
Thus the r.h.s. of (4.7) becomes J2 y es( K (y))+i which by definition [see (4.2)] 
is equal to So the claim holds for n = 0. 

Now suppose (4.7) holds for n = 0, . . . , N for some N > 0. In order to 
prove the claim for n = N + 1, we pick any A € -?0v+2 and observe that 



(4.8) 



E 

yeS 



\xeS N + 1 ' 



E 

yes 

E 



es N + 1 

*N(\' y )+ E (vWj + f E V 



+ E ( A ^))- 



where the first equality results from the definition of in (4.2) and the 
second equality uses the trivial identity 

J2 E (V*))+= E ( A W)+ 

and the relation (4.6) (with k replaced by A and n by N). 
On the other hand, by the definition given in (4.2) we have 



(4.9) 



V N+2 {\) = £ (*(*))+ 

265^+2 



To compare the r.h.s. of (4.8) with the r.h.s. of (4.9), note that the term 
J2zeS N + 2 (M z ))+ 1S common to both sides and, since (4.7) is satisfied with 
n = N by the inductive hypothesis, the remaining two terms are also equal: 



yes 



\ueS N ' 



This establishes (4.7) for n = N + 1 and the lemma follows by induction. □ 



We will need one more definition to facilitate the main proof. Fix a func- 
tion k £ K n and consider some properties that any other function a £ K n 
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might have relative to k: 

(SL1) for all x G S n , 

< a(x) <n- 1{ K ( X ) >0 } 
(SL1') for all x G S n , 

< a(x) <n- t {ti{x)>0} + 1 
(SL2) for all x,y G S n , 

sgriK(x) = sgnK(y) — < d(x,y) 

(SL3) for all x,y £ S n with d(x, y) = 1, 

sgnK(x) > sgnK(y) a(x) < a(y) < a(rr) + 2 

We define ^4 n (^) to be the set of a € A' n that satisfy (SLl), (SL2) and 
(SL3). Similarly, we write B n (K) to denote the set of all (5 G K n that satisfy 
(SLl'), (SL2) and (SL3). (A possible descriptive name for these objects is 
"sub-Lipschitz polytopes" — hence the letters SL.) 
The following is almost an immediate observation. 

Lemma 4.4. For n G N and k G K n , the following two properties hold: 

(a) A ti (k) and B h (k) are compact, convex polytopes in [0, ra+1] ; 

(b) for all y G 5, 

(4.10) naeA n (K) => ay sBn-^Ky). 

Proof. Property (a) is verified by checking that each of (SLl), (SLl'), 
(SL2) and (SL3) is closed under convex combinations. To verify property (b), 
fix y G S and choose a G A n (rt). Using the definition a y (x) = a(xy) and the 
fact that d(x,z) = d(xy,zy) for x,z G <S n_1 , it is straightforward to check 
that the fact that a satisfies (SL2) [resp., (SL3)] relative to k implies that 
a y also satisfies the same property relative to n y . Moreover, since 

n ~ 1{k(ie)>0} = ( n - 1) ~ 1{k(z)>0} + 1) 

the fact that a satisfies (SLl) relative to k implies that ct y satisfies (SLl') 
relative to n y . This proves property (b) and hence the lemma. □ 

We will also need the following simple fact about B n {K) [which, in fact, 
also holds for A n (K)]. 

Lemma 4.5. For n G N and any k G K n , if f3 is an extreme point of 
B n (n), then (3{x) is an integer between and n + 1 for every x G S n [in 
other words, (3 G B n {n) fl Z5. ]. 

Proof. Fix n G N and k G K n . We will establish the lemma using an 
argument by contradiction. Suppose that B n (n) has an extreme point [3 
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that takes on a noninteger value for some x € S n . Let E C S n be the set of 
elements of S n on which j3 is not an integer: 

E = {xES n : (3{x) £ N}. 

Define (3f by 

(4.11) /3+(x)=/3(x)+el {x . ei3} , 
and similarly 

(4.12) f3~(x)=(3(x)-el {xeE} , 
where e € (0, 1/2) is chosen small enough to satisfy 

(4.13) [£(s)J<0(z)<[-/3(s)l, 
for (3 = f3~ and (3 = [3f and 

(4.14) (3(x)<(3(y) /3(x) + e < /%) 

for all x,y (zS n (such a choice of e is feasible because 5 n is a finite set). 

We claim that (3+ G B n (n). Hx£E, then f3+(x) = (3~{x) = (i(x). On the 
other hand, if x S E, then (SLl') must hold with strict inequalities: 

< (3{x) <n- 1 {K ( X ) >0} + 1. 

Together with the condition (4.13), this ensures that (3f and j3~ also satisfy 
(SLl'). Similarly, the relation (4.14) can be used to infer that (3f and (3~ sat- 
isfy (SL3). It only remains to verify (SL2). First observe that [3f{x)—f3f{y) = 
f3(x) — P(y) whenever {x,y} C E, {x,y} C E c . Thus to prove (SL2), by sym- 
metry it suffices to only consider x,y G S n such that sgnn(x) = sgn n(y), 
x & E and y ^ E. In this case, we have f3(x) ^ (3{y) and 

(4.15) P+(x) - fl+(y) = /3(x) - /9(y) + e. 

If /3(x) < (3{y), then (4.14) and the fact that j3 satisfies (SL2) show that 

-d(x, y)<£- d(x, y) < 0{x) - (3(y) + e < 0. 

The last two displays, together, show that then (3f satisfies (SL2). On the 
other hand, suppose (3{x) > f3(y). Since d(-,-) is the Hamming metric and 
x ^ y, d(x,y) is an integer greater than or equal to 1. Moreover, (3{y) is also 
an integer since y ^ E. Together with the fact that (3 satisfies (SL2) this 
implies that (3(x) < (3(y) + d(x,y). Therefore, since 5 is finite, by choosing 
e > smaller if necessary, one can assume that (3f{x) < (3f(y) + d(x,y). 
When combined with the elementary inequality (3f{x) — j3f{y) = (3(x) + 
e — j3{y) > 0, this proves that /?+ satisfies (SL2). The argument for (3~ is 
analogous and thus omitted. 
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However, having proved that (3 £ ,f3^~ G B u (k), we now see that (3 = \(3+ + 
\j3 £ is a strict convex combination of two elements of B n (K), which contra- 
dicts the fact that it is an extreme point of this set. □ 

Let us observe that A n (n) C B n (K,). More importantly, we will utilize the 
structural relationship between A n and B n stated in the next lemma. 

Lemma 4.6. Let k G K n be given. For any j3 G B n {n) there is an a G 
A n (n) such that 



(4.16) (k,/3)< V k(x) + <«,«). 




Proof. Fix k G -ftT n throughout the proof. Since we are trying to bound 
for (3 G B u (k) and linear functions achieve their maxima on the ex- 
treme points of convex sets, and A n (K) is convex, it suffices to establish 

(4.16) only for (3 that are extreme points of B n (n). By Lemma 4.5, this im- 
plies that [3 G B n (K) n ZtjL . If, in addition, f3 G A n (n), the statement of the 
lemma is trivial; so we need only consider 

(4.17) 0e(B n ( K )\A n (K))nzl n . 

For (3 satisfying (4.17), define a by 

(4.18) a(x) = ((3(x) forxGcS™. 

We first claim that a G A u {k). To see why this is true, first observe that 
the fact that (3 satisfies (SLl') immediately implies that a satisfies (SL1). 
Moreover, for x, y G S n , if (3{x) < 1, (3{y) < 1, then a(x) — a{y) = 0; if (3{x) > 
l,(3(y) > 1, then a{x) - a(y) = (3(x) - (3(y); if (3(x) < l,/3(y) > 1, then 

> a(x) - a(y) = -(3(y) + 1 > (3(x) - (3(y), 

with an analogous relation holding if (3{x) > l,/3(y) < 1. 

Combining the above relations with the fact that (3 satisfies properties 
(SL2) and (SL3), it is straightforward to see that a also satisfies the same 
properties. Having shown that a G A n (n), we will now show that a satisfies 
(4.16). Proving (4.16) is equivalent to showing that 

(4.19) <«,$)= Y, <x)5(x)<(y k ( x )) 

where 5(x) = (3{x) — a(x) = l{/3(a;)>i} • To this end, we claim (and justify 
below) that for [3 satisfying (4.17) and for 5 defined as above, we have for 
all z G S n 

(4.20) k(z)<0 =>■ 6(z) = 1; 
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the relation (4.19) then easily follows from (4.20). 

Suppose, to get a contradiction, that (4.20) fails; this means that there 
exists a z € S n such that 

(4.21) k(z)<0, P(z)=0. 

Recall that by assumption, (3 £ A n (n); since (3 € B n (n), this can only happen 
if (3 violates (SLl), which can occur in one of only two ways: 

(i) (3(y) = n + 1 for some y € S n with n(y) < 0; 

(ii) @{y) = n for some y € S n with n{y) > 0. 

We can use (4.21) to rule out the occurrence of (i) right away. Indeed, in 
this case (4.21) and (i) imply that 

0(y) - (3(z) = ra + l- 0>n> d(y, z). 

Since sgn«;(?/) = sgn/=c(.z) = —1, this means f3 does not satisfy (SL2), which 
leads to a contradiction. 

On the other hand, suppose (ii) holds. Let x^ = z [where z satisfies 

(4.21) ], = y and let {x W }[=d c 5 " be such that d(x^,x^) = 1. Note 
that we can always choose r <n because the diameter of S n is no greater 
than n. Let f(i) = f3(x^'); thus /(0) = and f(r) = n. We say that a "sign 
change" occurs on the ithstep if sgnK(x^) ^ sgn«;(2;^ +1 )); we call this sign 
change positive if sgnn(x^) < sgnK(x^ +1 ^) and negative otherwise. Since 
k(x^) < and k(x^) > there must be an odd number of sign changes; 
furthermore, the number of positive sign changes exceeds the number of 
negative ones by 1. The fact that (3 satisfies property (SL3) implies that / 
cannot increase on a positive sign change and can increase by at most 2 
on a negative sign change. Moreover, since (3 also satisfies (SL2), we know 
that the value of / can change by at most 1 when there is no sign change. 
This means that after r steps, / can increase by at most r — 1, contradicting 
the fact that /(0) = and f(r) = n, as implied by (ii), and r < n. This 
establishes the claim (4.20) and hence completes the proof of the lemma. 
□ 

We need one more lemma before we can state the main result. 

Lemma 4.7. For n G N, for every k € K n and every a G A n (n), we have 

(4.22) (n,a) <*„_!(«')• 

Proof. Fix n 6 N and k E K n . Then for any ot G A. n (n), we prove the 
relation (4.22) by induction. For n = 1, property (SLl) dictates that < 
a(x) < 1{ k(x )< 0} , and so 

(k,o)= ^2 n(x)a{x) < = ^o(k')- 

xeS: k(x)<0 
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[Recall that by definition ^o(') — 0.] Now suppose that, for some j > 1 and 
n = j, (4.22) holds for all k G K n and all a G Pick any A G -Kj+i and 

any a G A,' + i(A) and decompose 

(4.23) (\,a) = J2(*y,<Xy)- 

By property (b) of Lemma 4.4, for all y G S, a G J 4 J ' + i(A) implies a y G 
Bj(Xy). Along with Lemma 4.6, this ensures the existence of 7W G A^A,,) 
such that 

Applying the induction hypothesis to 7* G Aj(\ y ) in the last display and 
using the trivial identity 

£A,(x) = £ X'y( U ), 



we see that 

Together with (4.23) and Lemma 4.3, the last display yields the inequality 



<( E w) +*i-i(K)- 



<A,«><£ 



53 w) 



which proves that (4.22) holds for n = j + 1. The lemma then follows by 
induction. □ 

We now state the main result of this section. 

Theorem 4.8. For n G N, and every k G K n and every <p G $ n , w;e /iaue 

(«,¥>) 

Proof. Fix n > 1, k G iT n and 92 G <5 n . Define the function 99 on S n by 

^(«) = (v(aj)-l{*(x)>o})+ 

for x G S n . Then 

(K,<p) < ( K i x })+ + ( k ><p) 

x<=S n 
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holds, since for any k G R and / G M+, 

fc/<(fc) + + /c(/-l {fc>0} ) + . 

In addition, (p is easily seen to be in A n (K,). An application of Lemma 4.7 then 
shows that (n,(p) < Since by definition, ^f n (K) =J2x<=s n ( K ( x ))+ + 

(«;'), we are done. □ 

Remark 4.9. Recalling the definitions (3.9) and (4.3) of the || • ||$ and 
|| • ||vp norms, respectively, and noting that ^ n ( K ) < || K H* f° r all k € lf n , it 
is clear that Theorem 4.1 is an immediate consequence of Theorem 4.8. 

5. The martingale bound for finite S. In Section 3 — specifically, Lemma 3.1 
and relation (3.7) — we showed that if S is finite, then given any probability 
measure P on (S n ,J-) and a 1-Lipschitz function <p, for 1 < i < n — 1, 

(5.1) ||^(^)||oo< . , max \(K[y l ~ 1 ,w,w],if)\, 

where Vi((p) are the martingale differences defined in (2.1) and the function 
k = K{y % ~ 1 , w, w] G K n is given explicitly, as in (3.8), by 

k{x) = t {x i-i =y i-i } {t {Xi=w] r(x™ +1 | y^w) - 1 {Xi= u,}1P(x? + i I V^w))- 

(5.2) 

The crux of the proof of Theorem 2.1 for finite S is the following result, 
which is proved using Theorem 4.1 of Section 4. 

Theorem 5.1. Suppose S is finite and ¥ is a probability measure defined 
on (S n ,J-). Moreover, given any 1 < i < n — 1, G <S* _1 and w,w G 5, 
/et i/ie function n[y l ~ 1 , w, w] G if n 6e defined by (5.2) and i/ie coefficients 
riij{y l ~ l ,w,w) and H n< i be defined by (1.2) and (1.4), respectively. Then for 
any function tp G <3? n; 

n 

(5.3) | (k^ -1 , «;,«;],¥>) | < 1+ ^ ^(y^ 1 ,w,w) < H nA . 

j=i+i 

Proof. The second inequality is a direct consequence of the definition 
of the coefficients H n ^. In order to prove the first inequality, first fix n G N, 
the measure P on (S n , F), 1 < i < n— 1, y l ~ l G S l ~ l and w,w GS, and set 
k = , u;, it)] . Then let L = n — i + 1 and for z G <S* _1 , define the operator 

T z : K n — > Kl as follows: for A G K n and x G S L , 

(T z X)(x) = X(zx). 

Given ip G <3? n , note that T y i-np G <3?l and, due to the structure of k = 
w] given in (5.2), the relation 

{k,<p) = (T yi -iK,T y i-i<p) 
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holds. Combining this with Theorem 4.1 and definitions (3.9) and (4.3) of 
the norms || • ||$ and || • ||^, we observe that 

I (/c, I < IIT^-ikIU < ||T^-ik||^ = max tpL(sT v i-iK). 

se{-i,i} 

Thus in order to prove the theorem it suffices to show that for s € { — 1, 1}, 

n 

(5.4) vlM l) )<i+ e 

j=i+l 

where =T y i-iK. 

Now for £ = L, L — 1, . . . , 2, define 

where (') :isT n — ► K n _\ is the marginal projection operator defined in (4.1). 
Then € Kg and a direct calculation shows that for i < j <n and x € 

(5.5) K {n - j+1) {x) = ¥{X] = x \X i = y l - l w} - F{X™ = x | X* = y^w}. 

Since is a difference of two probability measures on 5 n_:,+1 , we 

have by (1.18) that 

_ in ( n -j+l)|| _ \p ( K (n-j+l) ( T \\ 

Together with (5.5), this immediately shows that for i <j <n, 
E (^ n - j+1) (x)) + = r hj (y i -\w,w). 

Now, from the definition of the *$> n functional (4.2), we see that 

*L(^ (L) )= E (^ i) (x)) + + fL-l(^ (i " 1) ) 

= e (* {L) (*)) + + E E (^ +1) (*)) + - 

It follows trivially that J2xes L ( K ' ( ' L \ x ))+ — 1- Together, the last three state- 
ments show that (5.4) holds when s = 1. The inequality (5.4) with s = — 1 
can be established analogously, and hence the proof of the theorem is com- 
plete. □ 
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Proof of Theorem 2.1 for finite S. Given a finite set <S, choose 
any probability measure P on <S n . By Remark 2.2, in order to prove Theorem 
2.1 it suffices to establish the bound ||Vi(<^)||oo < Hn,i for 1 < i < n only for 
functions tp G <£ n . When i = n, the bound follows from (3.6) and Lemma 3.1, 
while for 1 < i < n — 1, it can be obtained by taking the maximum of the 
left-hand side of (5.3) over y 1 " 1 G S l ~ l , w, w G 5 and combining the resulting 
inequality with (5.1). □ 

6. Extension of the martingale bound to countable S. In this section 
we use an approximation argument to extend the proof of Theorem 2.1 
from finite S to countable S. The key to the approximation is the following 
lemma. 

Lemma 6.1. Let S be a countable space and for some n G N, let ip be 
a 1-Lipschitz function on S n . Let ¥ be a probability measure defined on 
{S n ,J r ) such that min^ n inf^ig^) . p^i_yi^Q P(A^* — y 4 ) > as defined in 
(2.1) and (1.1), respectively. If there exists a sequence of probability measures 
(p( m ) , m G N} such that 

(6.1) lim ||P-P (m) || TV = 0, 

then 

( 6 - 2 ) J™oi m)= ^ ^ m lh S ll^ (m) ^)Hoo = II^MIIoo 

where, for ra£N, {Yi vf)} an( ^ {ViT} are the martingale differences and 
mixing coefficients associated with p( m ) , defined in the obvious manner. 

Proof. The convergence (6.1) automatically implies the convergence in 
total variation of the conditional distributions p( m )(- | ^4) to ¥(A) for any 
A £ S n with ¥(A) > (in fact the convergence is uniform with respect to 
such A under the stipulated condition). As an immediate consequence, we 

see that fjjtj^ — > fjij as m — > oo, and (since ip is bounded) that \\v} m \p) - 
Vi(p)\\oo — > 0, which implies the convergence in (6.2). □ 

Proof of Theorem 2.1. Suppose S = {si : i G N} and for m G N define 
S m = {sk G 5 : A: < m}. For any probability measure P on (S n ,J-), define the 
m-truncation of P to be the following measure on (S n , J-): for x G S n , 

(6.3) P( m )(x) = l {xe5 n } P(x) + t {x=(sm , Sm ,..., Sm)} nS n \S£). 

Since by construction p( m ) also defines a probability measure on and 
S m is a finite set, it follows from Section 5 [specifically, inequality (5.1) and 
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Theorem 5.1] that for any 1-Lipschitz function and 1 < i < n, 

n 

(6-4) ll^ (m V)IL<l+ £ 

j=i+i 

where (<p) and {fjl™^} are defined in the obvious fashion. 

On the other hand, it is easy to see that the sequence P( m ) converges to 
P in total variation norm. Indeed, 

(6.5) ||P-P( m )|| TV < £ ¥(z), 

and the r.h.s. must tend to zero, as m — > oo, being the tail of a convergent 
sum. Theorem 2.1 then follows by taking limits as m — > oo in (6.4) and 
applying Lemma 6.1. □ 



7. Applications of the main result. 



7.1. Bounding H n ^ for Markov chains. Given a countable set 5, let P 
be a (possibly inhomogeneous) Markov measure on {S n ,J-) with transition 
kernels pt{- | •), 1 < k < n — 1, as defined in (1.6), and let Oi be the ith 
contraction coefficient of the Markov chain, as defined in (1.7). The main 
result of this section is Lemma 7.1, which shows that Theorem 1.2 follows 
from Theorem 1.1 by establishing the bound (7.3). 

For 1 < k < n, let be the 5x5 transition probability matrix associ- 
ated with the kth step of the Markov chain: for 1 < k < n — 1 , 

Pij )= PkU\i) foriJeS. 
Then, using (1.18), the contraction coefficients 9^ can be rewritten as 

(7.1) e k = i sup -p^!f\. 

i,t'es jeS 

It is a well-known fact that if ||n||i < oo and 2~)jg5 u i = an d -P is a transition 
probability matrix, then 

(7.2) \\u T P\\i <0p|M|i, 

with Op defined as in the right-hand side of (7.1), but with PW replaced by 
P (for completeness, this fact is included as Lemma A. 2 of the Appendix) 
and where u T denotes the transpose of u. 

Lemma 7.1. Let S be a countable set, P a Markov measure on (S n ,J-) 
with transition matrices {P^}, 1 < k < n — 1, and let {fjij} and {8i} be 
defined by (1.1) and (7.1), respectively. Then for 1 <i < j <n, we have 

(7.3) fj ij <e i e i+1 ---e j ^ 1 
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and so 

IIAJoo <M n 

where ||A n || 00 and M n are given by (1-3) and (1.8), respectively. 

Proof. Let (Xj)i<j< n be the coordinate projections on (S n ,T,F), that 
define a Markov chain of length n. Fix 1 <i <n, y 4-1 G and w,w G 5. 
Using the relation (1.18) and the definition (1.2) of r]ij(y l ~ l ,w,w), we see 
that 

%(?/~\^,w) 

= ||£(A7 | X 1 = y^w) - £(X? \ X' 1 = y^w)^ 

= \Y, \ F i x j l = x 7 1 xi = y l ~ lw ) - F i x ? = x 7 1 xl = y <-1 «>}|. 

J 

However, by the Markov property of P, for any x™ G <S Tl ~- J+1 and z G 5*, 
P{X™ = ^ | X* = z} = F{X? +1 = x] +1 | Xj = Xj}¥{Xj = Xj \Xi = Zi}. 
Since P{X" + i = ^j+i I x j = x j} < 1> we conclude that for j > i, 
r] ij (y t ~ 1 ,w,w) 

<IY1 l P { X i = x i \Xi = w}- F{Xj = Xj \Xi=w}\ 

Xj£S 

= iy |( e («o _ e H) T pWp(i+i)...pO'-i) e fe)| 

= I||( e W - e ^) T pWp( i+1 > • ••P^'- 1 )|| 1 , 

where, for x G <S, G M 5 is the unit vector along the x coordinate, that 
is, for y ES, ei x) = 1 and = for all y G S, y ± x. Since \\e^ - e W ||i < 2 
and the fact that are transition matrices ensures that Sig<s[(( e ^' ) ) T P^)i ~ 
((e^) T P^)i] = for all k > 0, a repeated application of property (7.2) then 
yields the inequality 

3-1 

k=i 

The bound (7.3) follows by taking the supremum of the l.h.s. over all y 1 ^ 1 G 
<S l_1 and w,w G 5. The second bound is a trivial consequence of the first. 
□ 
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7.2. Bounding H n ^ for hidden Markov chains. In this section we apply 
the apparatus developed above to hidden Markov chains. Roughly speaking, 
the distribution of a hidden Markov chain with finite state space S (the so- 
called "observed" state space) is governed by an underlying Markov chain on 
a "hidden" state space S and a family of stochastic kernels qi(-\-) -S x S i— > 
[0,1], which quantify the probability of observing the state xg £ S, given 
that the Markov chain is in the (hidden) state §i at the Ith step. A rigorous 
definition is as follows. Given transition kernels Pi(-\-), i = 1, • • • , n, on S, let 
p be the associated Markov measure on S n : in other words, for x € S n , we 
have 

n-l 

fi(x) =Po{xi) Y[p k {x k+ i I x k ). 
k=l 

Let v be the probability measure on [S x S) n , equipped with the cr-algebra 
of all subsets, defined by 

n 

(7.4) v(x,x)=fj,(x)Y[qi(x£\xi), 

e=i 

where qi(- \ s) is a probability measure on S for each s € <S and 1 < I < 
n. It is easy to see that v is a Markov measure on (<S x S) n . Indeed, if 
Z = ((Xi,Xi), 1 < i < n) is a random variable defined on some probability 
space (CI, P) taking values in (<S x S) n with distribution u, then the above 
construction shows that for any (x,x) £ S x S and (y\,y\) e(5x S) 1 , 

F{(X i+u X i+1 ) = (x,x) | (X\,X[) = (y[,y\)} 
= Pi{x\ m)qi+i{x | x) 

= f{{X i+1 ,X i+1 ) = (x,x) | [X^Xj) = (yuyi)}. 

The hidden Markov chain measure is then defined to be the 5 n -marginal 
p of the distribution v: 

(7.5) p(x) = P{X = x} = u(x,x). 

f3x&/3S n 

The random process (Xi)\<i< n (or measure p) on S n is called a hidden 
Markov chain (resp., measure); it is well known that (Xj) need not be 
Markov to any order. We will refer to (Xj) as the underlying chain, which 
is Markov by construction. 

Theorem 7.1. Let (Xj)i<j< n be a hidden Markov chain, whose under- 
lying chain (X)i<j<„ is defined by the transition kernels pi(- \ ■). Define the 
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kth contraction coefficient 9k of the underlying chain by 

(7.6) 6 k = supjpfc(- | x) -pk(- | £')IItv 

x,x'&S 

Then the mixing coefficients fjij associated with the hidden Markov chain X 
satisfy 

(7.7) fHj<6i8n-i'-'8j-i, 
for 1 < i < j < n. 

The proof of Theorem 7.1 is quite straightforward, basically involving 
a careful bookkeeping of summation indices, rearrangement of sums, and 
probabilities marginalizing to 1. As in the ordinary Markov case in Section 
7.1, the Markov contraction lemma (Lemma A. 2) plays a central role. 

PROOF of Theorem 7.1. For 1 <i < j <n, y 1 ^ 1 € S 1 " 1 and u^-w- € S, 
by (1.18) we have 

= i £ |P{A7 = x? | X\ = [yf 1 Wi ]} - P{A7 = x] | X\ = [y*- 1 

*" 

Expanding the first conditional probability above, we obtain 
P{X;=^|Xl = [yi-V]} 

= £ £ = x], (XI X?) = (s\,s]) | X\ = [yt 1 w t ]} 



F{(Xj,X?) = (§{,§])} 
j^j- P{Xi = [yi- 1 ^]} 

3 s l 

x p{(xi,x?) = u-'wiix]) i (ij,!;) = 

which can be further simplified using the fact that, by the definition of P, 
F{(X{,X?) = ([yt'w^x]) | (Al, A7) = (s\,s])} 
= u(xj I s™)^(^ _1 I s\~ 1 )q i (w i | Si) 
and that, due to the Markov property of X and the definition of fj,, 
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Expanding F{X™ = x n j \X\ = [y^ 1 w[]} in a similar way, we then have 



IE 



where we set (recalling that p{[y^ 1 w]) = F{X{ = [y*" 1 w]} Vu; G 5) 



Since | X)y Oify \ <J2i a i \ J2j bj | for a, > 0, 6j G R, taking the summation over 
and the term //(s" +1 |sj)^(x"|s™) outside the absolute value on the right- 
hand side, interchanging summations over and and then using the fact 
that J2s n 1 ^(s^j+ilsj) J2x n v(x 1 j\s'j) = 1 for every §j, we obtain 



(7.8) 



mj(y[ 1 ,WiM) ^ IE 



IE 



5^ At (s| ) ju ( % 1 5^ ) ^ (2/ 1 1 I «1 *)<W 
^//(%|si)h 



where h € M 5 is the vector defined by 

(7.9) h.-^^M^r 1 ^])^- 1 !^- 1 )- 

sir 1 

Let AW) G [0, lp x ^ be the matrix with A { M) = F(Xj = s' \ X { = s 
n(s' | s) for s,s G 5. Then the bound (7.8) can be recast in the form 

THbt\vHA) < IE l(h T ^ j) k.| = il|h T ^|| 1 . 



Since ^4 is simply the transition matrix of the Markov chain X from step 
i to j, the contraction coefficient of A^'-?') is clearly bounded by nl=i$fc- 
Therefore, to prove the theorem, it suffices to verify that the assumptions 

(7.10) 



]T]h fi = and |||h||i<l 

ves 
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of the contraction Lemma A. 2 are satisfied. Now, expanding (7.9), we have 
Summing the first term over v, and using (7.4) and (7.5), we obtain 

W E Kyr 1 1 5S- 1 )p{(^ 1 ,x i ) = (s\-\wi)} = i. 



An analogous identity holds for the summation over v of the second term, 
which proves (7.10) and, hence, the theorem. □ 



Observe that the 77-mixing coefficients of a hidden Markov chain are 
bounded by the contraction coefficients of the underlying Markov one. One 
might thus be tempted to pronounce Theorem 7.1 as "obvious" in retrospect, 
based on the intuition that, conditioned on the hidden Markov chain X, the 
observed process pQ)i<j< n is a sequence of independent random variables. 
Thus, the reasoning might go, all the dependence structure is contained in 
Xi, and it is not surprising that the underlying process alone suffices to 
bound fjij — which, after all, is a measure of the dependence in the process. 
Such an intuition, however, would be wrong, as it fails to carry over to the 
case where the underlying process is not Markov. A numerical example of 
such an occurrence is given in Kontorovich's Ph.D. thesis [15] and, prior 
to that, in [13], which is also where Theorem 7.1 was first proved. These 
techniques have been extended further to prove concentration for Markov 
tree processes; see [14] or [15]. 

7.3. Tightness of martingale difference bound for Markov measures. Given 
a probability measure P on (S n , J-), from (3.2), we know that the associated 
martingale differences {Vi(ip)} satisfy 

(7.11) ||^M||oo = max \Vi(<p;z% 

where for 1 < i < n and x % (zS l , 

Vi(<p; z*) = E[<p(X) | X i = z*] - E[<p(X) \ X^ 1 = z^ 1 }. 
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Just as Vi(-\y l ~ l ,w,w) could be expressed as an inner product in (3.7) and 
(3.8), Vi(-;z l ), being a linear functional on K n , also admits a representation 
in terms of an inner product. Indeed, for z l £ S l , we have 

Vl{<p;j) = {K[z i ],<p) 

where k = k[z 1 ] £ K n has the form 

(7.12) k{x) = t {xl=z i }P (x? +1 | z l ) - l {xi -i =zi -i } p{xf | z^ 1 ) 

for x £ S n . When combined with the definition of the norm || • ||$ and The- 
orem 4.1, this shows that 

max ||Vi((/?)||oo = max max \(k(z 1 ), tp)\ = max ||k(z 1 )||$ < max ||k(z*)||iI(. 

It is of interest to ask whether this martingale difference bound is tight, 
and if so, whether it is possible to obtain a simple description of a class 
of extremal functions ip for which the right-hand side is attained. In this 
section, we identify such a class when P is a Markov measure. 

The main result is encapsulated in Theorem 7.5, whose statement requires 
the definition of the BAR class of extremal functions. 

Definition 7.2. A function if € $ n is said to admit a binary additive 
representation if there exist functions fj,f.S—t {0, 1}, £ = 1, . . . , n, such that 
for every x G S n , 

n 

(7.13) <p(x) = ^2m(x t ). 

l=i 

In this case, we call tp a BAR function and let $ n denote the collection of 
BAR functions in 

Remark 7.3. Observe that any map i" i-> R of the form (7.13) is 1- 
Lipschitz and has range in [0, n) . Since $ n is an uncountable set while <& n is 
finite, we trivially have <& n C <3> n . To get a meaningful size comparison, let 
us examine the integer- valued members of $ n , denoted by = $ n n N 5 '\ 
For |<S| > 2, a crude lower bound on the cardinality of $ n is 

|<&n|>2l 5 l n . 

On the other hand, the cardinality of is easy to compute exactly: 

|$n| =2 n|5 L 



Thus the vast majority of ip £ & n are not BAR functions. 
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We first begin with a lemma that shows that the norms || • ||$ and || • 
||$ coincide on the subset of so-called Markov-induced functions in K n . 
In order to state the lemma, we need to introduce some notation. Fix an 
(inhomogeneous) Markov measure P on (5 n , J 7 ) and z G S, and let po(-) and 
{Pk{~ | •) : 1 < k < n} be the associated initial measure and transition kernels, 
respectively. In this case, k = k[z 1 ] in (7.12) can be rewritten as 

(7.14) k{x) = a(xi) l {x <-i =z *-ij Y[ Pk(xk+i I x k ) J 

V k=i J 

for x G <S n , where <r = <r[z l ] G K\ is the real- valued function on S defined by 

(7.15) a{y) = t{ y = Zi } - Pi-i(y | Zj-i). 

In the case i = 1, by our conventions, the above relations reduce to the 
following: 

n-i 

(7.16) /c(x) = cr(xi) Pfc(x fc+ i | 

fc=i 

for x G (S n , where cr = o~(z) G is the function on S given by 

a(y) = l {y=z} -p (y). 

For any n G N and k G we say that k is Markov-induced if it has the 
form (7.14), for some collection of transition kernels {p/c,l < k < n} with 
Pk(z\y) > for all 1 < k < n and z, y G S and function a G K\. 

Lemma 7.4. For any Markov-induced k G K n , there exists a BAR func- 
tion tp G $ n s-uc/t i/tai 

= *„(«), 

and so 

|<«,<p)| = ||k||*. 

Proof. We shall first prove this result for the case when i = 1. In this 
case, k takes the form (7.16) and satisfies the key property that for x G S n , 

(7.17) sgn(«(x)) = sgn(cr(zi)), 

meaning that sgn(«;(x)) is a function of x\ only. Thus we refer to a as the 
sign function of k. 

We first claim that for any £ G N, if G Ki is of the form (7.16) with 
some sign function a^> G K\, then (k^)' G is Markov-induced with 

sign function o~^ _1 ) given by 

(7.18) a i - l - l \z) = Y J ^ ) ( x )Pi{z\x) forzGcS. 
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[Here, (') : Kg — > Kg_\ is the marginal projection operator defined in Sec- 
tion 4.] This is readily verified by observing that for z G S i_1 , 

( K «)^) = 5>W)M 

e-i 

= ^ (T^ ) {x)pi(zi I X) Y[ Pk{Zk I Zfc-l) 
zG<S fc=2 

\fc=2 / xeS 

= a ( - e ~ 1 \z 1 ) Y[p k+1 (z k +i | z k ), 
k=l 

which is of the form (7.16) with sign function a^ l \ 

Thus, given a Markov-induced n G K n with associated sign function a G 
K x , first define = k, <j( n ) = a and, far 4 = n, ... ,2, let k^" 1 ) = («W)' 
and let <t (£_1) G i^i be the sign function of 1 '. Then, for £ = 1, . . . ,n, 
each G satisfies 

(7.19) sgn(/«W(x)) =sgn(ffW(xi)). 

Next, construct the sequence of functions fi\, . . . ,/i n , from the sequence re' 1 ), 
Kr' , . . . , K> n > with //£ : S — > {0, 1} given by 

(7.20) ^(s) = l {t7 (n-M-l) (a:)>0} . 

Then the function y> : S n — > R defined by 

n 

(7.21) ^)=Ew(*i) 

for x G 5 n , is easily seen to belong to $ n . Moreover, note that 

a;G5 n 

n 

= ^ /Ui(xi)k(x) + ^ M2(^2)k(x)H h E /i n (x n )re(x) 

= ^ W (x 1 )^)(x?)+^ / u 2 (x 2 ) K (™- 1 )(x^) + ---+ ]T /i n (x n )re«(x, 

x™ x% Xn&S 1 

= E 1 {<x(«)(x l) >o}« (n) (*?) + E 1 {<x(«-D(x 2 )>o}^ (n " 1) (4) + • • • 
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+ Y, 1 {^ 1 Hx n )>0} K{1 H X n) 

X n 

= £( K(n) (*)) + + £ (^ (n - 1) (^)) + + 

= *n(«), 

where the second to last equality uses the property (7.19) and the last equal- 
ity follows from the definition of the operator ty n . This completes the proof 
of the first statement of the lemma. Due to the definition of the norm || • ||$, 
the second statement is a simple consequence of the first. 

The case of general 1 < i < n can be dealt with by a corresponding ex- 
tension of Lemma 7.4 from Markov-induced k of the form (7.16) to k of 
the form (7.14), which can be achieved by the dimension-reducing technique 
employed in Section 4. We omit the details. □ 

The last lemma immediately implies the following extremal property of 
BAR functions with respect to martingale differences of Markov measures. 

Theorem 7.5. Given a finite set S and a Markov measure P with full 
support on (S n ,J-) as defined in (1.6), for every 1 < i < n — 1, there exists 
a BAR function (p £ $ n such that 

(7.22) ||Vi(^)||oo = max ||Vi(^)||oo = max = max 

ve$n z j eS ! z^S 1 

Proof. Given any (p G $ n , let z l be the element of S % that achieves 
the maximum in the right-hand side of (7.11). Then the discussion at the 
beginning of the section, along with Lemma 7.4 and Theorem 4.1, shows 
that (when i = 1) there exists a BAR function tp such that 

= | («[**], £>l = Hz']!!* > Hz^U > \{K[z\ip)\ = Halloo- 

Taking the maximum over z l 6 <S\ we conclude that 

||Vi(^)||oo = max ||k[z 1 ]||<i- > max > ||^(v?)||oo- 

Taking the maximum of the left-hand side over BAR functions (p (and, 
without loss of generality, denoting a maximizing function there again by 
(p), and then taking the maximum over the right-hand side over functions 
(p £ $ n , the fact that $ n C $ n shows that the inequalities can be replaced 
by equalities and hence (7.22) follows. □ 



■■•+E(" (1) o»0) 
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APPENDIX 

A.l. The norms || • ||^ and || • ||^. The following result, while not directly 
used in the paper, may be of independent interest. 

Lemma A.l. For n > 1, both the Junctionals \\ ■ ||$ and \\ ■ ||* described 
in (3.9) and (4.3), respectively, define norms on K n . 

Proof. It follows trivially from the definition 
||sc||$ = max | | = max 

that for k G K n , \\k\\$ > 0, ||k||<j> = if and only if k = (to see this, choose 
f(x) = l{ K (x)>o}) an d ||aft||$ = |a|||ft;||$ for a G KL Last, the triangle inequal- 
ity \\ki + 1| <s> < || K i||<i> + || K 2||<i> follows immediately from the linearity of 
(•, •) in the first variable and the fact that | • | satisfies the triangle inequality. 
This shows that || • ||$ defines a norm on K n . 
We now consider the functional 

||k||^t = max{$ n (K),$ n (-«;)} 

with the operator ^> n defined recursively through the relation 

*n(«)= J2 («(a;))+ + *r l -l(« / ) ; 

with k' G K n -i given by n'{y) = J2 x1 gS K ( x iy)- The fact that \&o = 0, along 
with the above recursion relation, immediately guarantees that for all £ = 
0, 1, . . . , and A G K £ , V e (\) > and V e (\) > E xe5 (A(x)) + . If k{x) ^ for 
some x G S, then the latter quantity is strictly positive for either A = k or 
A = — k, which implies that = if and only if n = 0. The homogeneity 
property of the norm || • ||<f follows from the corresponding property, for 
the operator ty n — namely, ^> n (an) = a^ n {n) for a > 0. Last, the triangle in- 
equality is a consequence of the property ^ n {^i + ^2) < 1 f n (Ki) + ^(^2) for 
every kx, K2 G K n , which can be deduced using the subadditivity of the func- 
tion f(z) = (2)+, the fact that trivially satisfies the triangle inequality 
and induction. □ 

A. 2. Contraction lemma. For completeness, we include the elementary 
proof of a bound that was used in the proof of Lemma 7.1. For finite <S, a 
proof of this result goes back to Markov [18] (see Section 5 of that work, 
or Lemma 10.6(h) of [3]). We recall that u T denotes the transpose of the 
vector u G M. s . 



Y J i^(x)(p(x) 



xGS 
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Lemma A. 2. Let S be a countable set, u & M. s be such that Y^ies u i = 
and ||u||i < oo, and let P be an S x S matrix such that u T P is well defined. 
Then 



(A.l) 



\u T P\\i<0 P \\u\\i, 



where Op is the contraction coefficient of P: 
(A.2) P = \ sup J2\Pij - P ifj \. 

Proof. Let yi = \ui\, and define /+, /_ as follows: I + = {i E S : ui > 0} 
and I- = {i £ S : Ui < 0} . Then for any finite J QS, 



jeJ jeJ 



i€l+ i€l- 



where Qi = J2jeJ Pij- Thus, we obtain 



E(« Tp )i 



< 



E QiVi ~ E QiVi 

iei+ jei- 



sup Q k ) VJ ~ [tof Qk) E Vi 



kei+ 



iei- 



\ u \\i 



sup Q k - inf Q k 



< o ll^lli SU P 



i. % 



3&J 



<0p\\u\\i. 

Taking the supremum of the l.h.s. over all finite JC5 yields the result. □ 
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